By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io
translated by 谷歌翻译
Temporal video segmentation and classification have been advanced greatly by public benchmarks in recent years. However, such research still mainly focuses on human actions, failing to describe videos in a holistic view. In addition, previous research tends to pay much attention to visual information yet ignores the multi-modal nature of videos. To fill this gap, we construct the Tencent `Ads Video Segmentation'~(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level. TAVS describes videos from three independent perspectives as `presentation form', `place', and `style', and contains rich multi-modal information such as video, audio, and text. TAVS is organized hierarchically in semantic aspects for comprehensive temporal video segmentation with three levels of categories for multi-label classification, e.g., `place' - `working place' - `office'. Therefore, TAVS is distinguished from previous temporal segmentation datasets due to its multi-modal information, holistic view of categories, and hierarchical granularities. It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels. Accompanied with TAVS, we also present a strong multi-modal video segmentation baseline coupled with multi-label class prediction. Extensive experiments are conducted to evaluate our proposed method as well as existing representative methods to reveal key challenges of our dataset TAVS.
translated by 谷歌翻译
In this work we propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance. The conventional T-T based SCD model loss optimizes all output tokens equally. Due to the sparsity of the speaker changes in the training data, the conventional T-T based SCD model loss leads to sub-optimal detection accuracy. To mitigate this issue, we use a customized edit-distance algorithm to estimate the token-level SCD false accept (FA) and false reject (FR) rates during training and optimize model parameters to minimize a weighted combination of the FA and FR, focusing the model on accurately predicting speaker changes. We also propose a set of evaluation metrics that align better with commercial use cases. Experiments on a group of challenging real-world datasets show that the proposed training method can significantly improve the overall performance of the SCD model with the same number of parameters.
translated by 谷歌翻译
While recent research advances in speaker diarization mostly focus on improving the quality of diarization results, there is also an increasing interest in improving the efficiency of diarization systems. In this paper, we propose a multi-stage clustering strategy, that uses different clustering algorithms for input of different lengths. Specifically, a fallback clusterer is used to handle short-form inputs; a main clusterer is used to handle medium-length inputs; and a pre-clusterer is used to compress long-form inputs before they are processed by the main clusterer. Both the main clusterer and the pre-clusterer can be configured with an upper bound of the computational complexity to adapt to devices with different constraints. This multi-stage clustering strategy is critical for streaming on-device speaker diarization systems, where the budgets of CPU, memory and battery are tight.
translated by 谷歌翻译
解释深度卷积神经网络最近引起了人们的关注,因为它有助于了解网络的内部操作以及为什么它们做出某些决定。显着地图强调了与网络决策的主要连接的显着区域,是可视化和分析计算机视觉社区深层网络的最常见方法之一。但是,由于未经证实的激活图权重的建议,这些图像没有稳固的理论基础,并且未能考虑每个像素之间的关系,因此现有方法生成的显着图不能表示图像中的真实信息。在本文中,我们开发了一种基于类激活映射的新型事后视觉解释方法,称为Shap-Cam。与以前的基于梯度的方法不同,Shap-Cam通过通过Shapley值获得每个像素的重要性来摆脱对梯度的依赖。我们证明,Shap-Cam可以在解释决策过程中获得更好的视觉性能和公平性。我们的方法在识别和本地化任务方面的表现优于以前的方法。
translated by 谷歌翻译
需求估计在动态定价中起着重要的作用,在动态定价中,可以通过基于需求曲线最大化收入来获得最佳价格。在在线酒店预订平台中,房间的需求或占用率随着房间类型而变化,随着时间的推移变化,因此获得准确的占用估算是一项挑战。在本文中,我们提出了一种新颖的酒店需求功能,该功能明确地模拟了对占用预测需求需求的价格弹性,并设计了价格弹性预测模型,以了解各种影响因素的动态价格弹性系数。我们的模型由精心设计的弹性学习模块组成,以减轻内生性问题,并在多任务框架中接受培训以解决数据稀疏性。我们在现实世界数据集上进行了全面的实验,并验证方法优于最先进的基准,以实现占用预测和动态定价。
translated by 谷歌翻译
可变形的模型对于3D面的统计建模至关重要。以前的可变形模型的作品主要集中在大规模的面部几何形状上,但忽略了面部细节。本文通过学习一种结构含义的可编辑形态模型(SEMM)来增强形象模型。 SEMM基于皱纹线的距离字段引入了细节结构表示,并以细节位移进行建模,以建立更好的对应关系并实现对皱纹结构的直观操纵。此外,SEMM还引入了两个转换模块,以将表达式的融合体权重和年龄值转化为潜在空间的变化,从而在维持身份的同时可以有效的语义细节编辑。广泛的实验表明,所提出的模型紧凑地表示面部细节,在定性和定量上表达动画中的先前方法,并实现了面部细节的有效年龄编辑和皱纹线编辑。代码和模型可在https://github.com/gerwang/facial-detail-manipulation上找到。
translated by 谷歌翻译
由于空间分辨率的巨大改进,4K内容可以为消费者提供更严肃的视觉体验。但是,由于分辨率扩大和特定的扭曲,现有的盲图质量评估(BIQA)方法不适合原始和升级的4K内容物。在本文中,我们提出了一个针对4K内容的深度学习的BIQA模型,一方面可以识别True和pseudo 4K内容,另一方面可以评估其感知视觉质量。考虑到高空间分辨率可以代表更丰富的高频信息的特征,我们首先提出了基于灰色级别的共发生矩阵(GLCM)的纹理复杂度度量,以从4K图像中选择三个代表性图像贴片,这可以减少计算复杂性,被证明对通过实验的总体质量预测非常有效。然后,我们从卷积神经网络(CNN)的中间层中提取不同种类的视觉特征,并将它们集成到质量感知的特征表示中。最后,使用两个多层感知(MLP)网络用于将质量感知功能映射到类概率和每个贴片的质量分数中。总体质量指数是通过平均贴片结果汇总获得的。提出的模型通过多任务学习方式进行了训练,我们引入了不确定性原理,以平衡分类和回归任务的损失。实验结果表明,所提出的模型的表现均优于所有4K内容质量评估数据库中的BIQA指标。
translated by 谷歌翻译
Hybrid unmanned aerial vehicles (UAVs) integrate the efficient forward flight of fixed-wing and vertical takeoff and landing (VTOL) capabilities of multicopter UAVs. This paper presents the modeling, control and simulation of a new type of hybrid micro-small UAVs, coined as lifting-wing quadcopters. The airframe orientation of the lifting wing needs to tilt a specific angle often within $ 45$ degrees, neither nearly $ 90$ nor approximately $ 0$ degrees. Compared with some convertiplane and tail-sitter UAVs, the lifting-wing quadcopter has a highly reliable structure, robust wind resistance, low cruise speed and reliable transition flight, making it potential to work fully-autonomous outdoor or some confined airspace indoor. In the modeling part, forces and moments generated by both lifting wing and rotors are considered. Based on the established model, a unified controller for the full flight phase is designed. The controller has the capability of uniformly treating the hovering and forward flight, and enables a continuous transition between two modes, depending on the velocity command. What is more, by taking rotor thrust and aerodynamic force under consideration simultaneously, a control allocation based on optimization is utilized to realize cooperative control for energy saving. Finally, comprehensive Hardware-In-the-Loop (HIL) simulations are performed to verify the advantages of the designed aircraft and the proposed controller.
translated by 谷歌翻译
Graph Neural Networks (GNNs), originally proposed for node classification, have also motivated many recent works on edge prediction (a.k.a., link prediction). However, existing methods lack elaborate design regarding the distinctions between two tasks that have been frequently overlooked: (i) edges only constitute the topology in the node classification task but can be used as both the topology and the supervisions (i.e., labels) in the edge prediction task; (ii) the node classification makes prediction over each individual node, while the edge prediction is determinated by each pair of nodes. To this end, we propose a novel edge prediction paradigm named Edge-aware Message PassIng neuRal nEtworks (EMPIRE). Concretely, we first introduce an edge splitting technique to specify use of each edge where each edge is solely used as either the topology or the supervision (named as topology edge or supervision edge). We then develop a new message passing mechanism that generates the messages to source nodes (through topology edges) being aware of target nodes (through supervision edges). In order to emphasize the differences between pairs connected by supervision edges and pairs unconnected, we further weight the messages to highlight the relative ones that can reflect the differences. In addition, we design a novel negative node-pair sampling trick that efficiently samples 'hard' negative instances in the supervision instances, and can significantly improve the performance. Experimental results verify that the proposed method can significantly outperform existing state-of-the-art models regarding the edge prediction task on multiple homogeneous and heterogeneous graph datasets.
translated by 谷歌翻译